Variant standardization engine

ABSTRACT

The invention provides a system and method for searching a piece of information from an electronic document, a website or the Internet. The system first standardizes the primary entry entered by the user and then matches the standardized entry to a categorically unique referent in a database, and then identifies the variants of the categorically unique referent and reports all or some of the variants to the search module as search queries.

This application claims priority to the U.S. provisional patentapplication Ser. No. 60/585,296, filed on 2 Jul. 2004, the contents ofwhich are incorporated by reference herein.

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention relates generally to electronic searching technology.More particularly, the invention relates to a system and method forconducting various automatic steps of dialectal/variant standardizationin a web-based search engine.

2. Description of Prior Art

The World Wide Web is a fast expanding terrain of information availablevia the Internet. The sheer volume of documents available on differentsites on the World Wide Web (“Web”) warrants that there are efficientsearch tools for quick search and retrieval of relevant information. Inthis context, search engines assume great significance because of theirutility as search tools that help the users to search and retrievespecific information from the Web by using keywords, phrases or queries.

A whole array of search tools, such as Google, Yahoo, AltaVista, Excite,HotBot, Lycos, Infoseek, Overture, and web Crawler, are available thesedays for users to choose from in conducting their search. However,search tools are not all the same. They differ from one anotherprimarily in the manner they index information or web sites in theirrespective databases using a particular algorithm peculiar to thatsearch tool. It is important to know the difference between the varioussearch tools because while each search tool does perform the common taskof searching and retrieving information, each one accomplishes the taskdifferently. Hence, the difference in search results from differentsearch engines even though the same phrases/queries are entered.

Search tools of different kinds fall broadly into five categories, i.e.directories, search engines, super engines; meta search engines; andspecial search engines.

A search engine allows searching of searchable online databases. It hasseveral components: search engine software, spider software, an index(database), and a relevancy algorithm (rules for ranking). The searchengine software consists of a server or a collection of serversdedicated to indexing Internet Web pages, storing the results andreturning lists of pages to match user queries. The spider softwareconstantly crawls the Web, collecting Web page data for the index. Theindex is a database for storing the data. The relevancy algorithmdetermines how to rank queries. A search engine generally includesfeatures such as Boolean operators, search fields, display format, etc.

Search tools like Yahoo, Magellan and Look Smart qualify as webdirectories. Each of these web directories has developed its owndatabase comprising of selected web sites. Thus, when a user uses adirectory like Yahoo to perform a search, he is searching the databasemaintained by Yahoo and browsing its contents.

Search engines like Infoseek, WebCrawler and Lycos use software programssuch as “Web crawlers”, “spiders” or “robots” that crawl around the Weband index, and catalogue the contents from different web sites into thedatabase of the search engine itself. Web crawler programs are a subsetof software agents programs with an unusual degree of autonomy whichperform tasks for the user. These agents normally start with ahistorical list of links, such as server lists, and lists of the mostpopular or best sites, and follow the links on these pages to find morelinks to add to the database.

A more sophisticated class of search engines includes super engines,which use a similar kind of software as “Web crawlers”, “robots” or“spiders.” However, they are different from ordinary search enginesbecause they index keywords appearing not only on the title but anywherein the text of site content. Excite, OpenText, Hot Bot and Alta Vistaare examples of super engines.

A meta search engine is a search engine that queries other searchengines and then combines the results that are received from all. A userusing a meta search engine actually browses through a whole set ofsearch engines contained in the database of the meta search engine.Dogpile and Savvy Search are examples of meta search engines.

Special search engines are another type of search engines that cater tothe needs of users seeking information on particular subject areas. DejaNews and Infospace are examples of special search engines.

Thus, each one of these search tools is unique in terms of the way itperforms a search and works towards fulfilling the common goal of makingresources on the web available to users. Most search engines allow usersto type in a few words, and then search for occurrences of these wordsin their database. Each one has a special way of deciding what to doabout approximate spellings, plural variations, and truncation.

These search engines have a common imperfection, which is theinconsistency among the returned results as responses to various querieswhich have the same meaning. For example, at Google, the search resultsof “best cab-driver in New York” and “best taxi-driver in New York” aredifferent. At Yahoo, the search results of “icebox”, “refrigerator”,“fridge” and “Frigidaire” are different. For the same categoricalreferent, it is imperative to have same search results. Search is aboutcomprehensiveness as well as relevancy. A layman user is entitled tosearch results that are available to the well educated. There should bea mechanism to avail the search results of “contusion” to laymensearching for the results of “bruise”. The mid-westerners, familiar withterms of bygone era, such as “Frigidaire”, should be able to find, forthe same categorical identical referent, relevant search results of“refrigerator”.

Accordingly, it would be desirable to provide a system and method forautomatically standardizing the entry.

SUMMARY OF THE INVENTION

The present invention, defined by the appended claims with the specificembodiments shown in the attached drawings, is directed to a system andmethod that enables a search engine to return identical search resultsin responding to various entries which belong to a same categoricallyunique referent. The system first standardizes the primary entry enteredby the user and then matches the standardized entry to a categoricallyunique referent in a database, and then identifies the variants of thecategorically unique referent and reports all or some of the variants tothe search module as search queries.

In accordance with this invention, the user's entry for search isautomatically pre-treated as one or more queries based on linguisticstandardization and/or optimization. The linguistic standardization isbased on the concept of a categorically unique referent (CUR). Eachcategorical word belongs to a CUR. Each CUR may include a number ofvariants in dialects or in regional variations or social-economic classvariations of a same dialect. When the user enters any variant of theCUR, the returned search results will be same. To meet the user'sspecial need, the system allows the user to set language backgroundbefore conduct a search and allows the user to choose a search mode fromfull search, optimized search and concise search.

In one preferred embodiment, the invention provides an application thatruns in a local computer or a local network. Using this application, theuser may conduct a search through the documents stored in the computeror the network.

In another preferred embodiment, the invention provides an applicationthat runs in a website server. Upon entering the website, the user mayconduct a search through all pages available in the website.

In another preferred embodiment, the invention provides an applicationthat runs in a web-based search engine's host server. Upon entering thewebsite of the host, the user may conduct a search through allsearchable information available on the Internet.

The foregoing has outlined, rather broadly, the more pertinent andimportant features of the present invention. The detailed description ofthe invention that follows is offered so that the present contributionto the art can be more fully appreciated.

BRIEF DESCRIPTION OF THE DRAWINGS

For a more succinct understanding of the nature and objects of thepresent invention, reference should be directed to the followingdetailed description taken in connection with the accompanying drawingsin which:

FIG. 1 is a schematic diagram illustrating a computer environmentwherein the preferred embodiment of this invention operates;

FIG. 2 is a block diagram illustrating the basic steps of the processaccording to this invention;

FIG. 3 is a schematic block diagram illustrating an application runningon a local computer according to one preferred embodiment of thisinvention;

FIG. 4 is a schematic diagram illustrating the operations of D/Vstandardization according to FIG. 2 and FIG. 3;

FIG. 5A and FIG. 5B are two schematic flow diagrams illustrating amethod according the preferred embodiment of FIG. 3;

FIG. 6 is a schematic diagram illustrating an exemplary utilization ofthe invention in a website's server;

FIG. 7 is a schematic block diagram illustrating the operationsaccording to FIG. 6;

FIG. 8 is a schematic flow diagram illustrating a method according tothe preferred embodiment of FIG. 6 and FIG. 7;

FIG. 9 is a schematic diagram illustrating an exemplary utilization ofthe invention in a Web-based search engine's host;

FIG. 10 is a schematic block diagram illustrating the operationsaccording to FIG. 9; and

FIG. 11 is a schematic flow diagram illustrating a method according tothe preferred embodiment of FIG. 9 and FIG. 10.

DETAILED DESCRIPTION OF THE INVENTION

With reference to the drawings, the present invention will now bedescribed in detail with regard for the best mode and the preferredembodiments. In its most general form, the invention comprises a programstorage medium readable by a computer, tangibly embodying a program ofinstructions executable by the computer to perform the steps necessaryto standardize the search query entered by a user, such that when anyvariant of the standard search query is entered, an identical searchresult will be returned.

FIG. 1 is a block diagram illustrating the computer environment in whichone of the preferred embodiments of this invention operates. Thecomputer environment includes a computer platform 101 which includes ahardware unit 102 and an operating system 103. The hardware unit 102includes at least one central processing unit (CPU) 104, a read onlyrandom access memory (usually called ROM) 105 for storing applicationprograms, a write/read random access memory (usually called RAM) 106available for the application programs' operations, and an input/output(IO) interface 107. Various peripheral components are connected to thecomputer platform 101, such as a data storage device 108 and a terminal109. A search application 100 adapted to a data processing application110, such as Word, Word Perfect and Microsoft Excel etc., which supportsa searchable document, runs on the computer platform 101. Those skilledin the art will readily understand that the invention may be implementedwithin other systems without fundamental changes.

As illustrated in FIG. 2, the system and method according to the presentinvention, take place in three stages: Dialectal/Variant Standardization111, search on the variants of the D/V standardized entry 112, anddisplay search results 113.

FIG. 3 is a schematic block diagram illustrating one preferredembodiment of the present invention. The Dialectal/VariantStandardization Engine (herein after as DVSE) application 100 isincorporated in a data processing application which supports searchabledocuments. A user who opens a document 126 may conduct a search via auser graphical interface (GUI) 120 displayed on the user's screen 130.The user uses a language background setting means 121 to set a languagebackground from a number of choices such as current locale, parents'native tongue, schooling dialect, social dialect, most comfortabledialect. The language background setting means 121 can be a dropdownlist or a number of hyperlinked icons, each of which represents anoption. Typically, the user selects one option. However, the system canbe configured to enable the user to choose two or more at the same time.The default language background is preset by the manufacturer but theycan be re-set by the user. The default language background can beconfigured as the language background that the user used last time. Inthat case, the user does not need to set language background every timewhen he activates DVSE application. The D/V Standardization Module 111 ais a program which is powerful enough to screen, analyze, and transforma non-common use query, such as slang phrase, dialect phrase,teen-language, or specialized terms in medicine, chemistry and botanyetc., into a common use query or standardized query. For example, itknows to incorporate auto, automobile, vehicle etc. and standardize theinput through statistical abstraction and fuzzy logic. Thestandardization is based on the conception of “categorically uniquereferent”. The linguistic studies indicate that each categorical wordbelongs to a categorically unique referent (CUR) and each CUR has anumber of variants. The number of the variants changes from time to timewith the evolution of the languages. Among these variants, some areequivalent, but some others may be slightly different in relevancy.After a standardized entry is determined, the D/V Standardization Module111 a looks up to the Database 111 b which includes a relevancyalgorithm and a number of rules of ranking. Then, the D/VStandardization Module 111 a determines scope of variants to be chosen.In the preferred embodiments of this invention, the scope of variants ispresented as three basic modes: full search mode, optimized search mode,and precise search mode. In the full search mode, the D/VStandardization Module 111 a presents all or substantially all of theidentified variants of a CUR to the Search Module 125 which treats eachof the variants as a query and performs a search on each of thevariants. In the optimized mode, the D/V Standardization Module 111 aonly presents some of the variants of CUR. These variants are calledreportable variants. When the optimized search mode is chosen, the D/VStandardization Module 111 a will screen all variants of the CUR andchoose some of them based on relevancy or other values associated with avariant. In the precise search mode, the D/V Standardization Module willdisable the CUR function and only presents the user's entry to theSearch Module 127. If no result is found corresponding to the entry, thesystem will prompt the user to change the entry.

FIG. 4 is a schematic diagram illustrating the operations of D/VStandardization according to FIG. 2 and FIG. 3. In this example, if theuser enters any of: bike, cycle, bicycle, tandem, bycicle (misspelled),bicycle (misspelled), the D/V Standardization Module 111 a and theDatabase 111 b will first standardize the entry as “bicycle” whichrepresents a CUR. Then, the D/V Standardization Module 111 a pulls outthe full listings of the variants of CUR “bicycle”. In this example, thefull listing of the CUR bicycle's variants include “bicycle”, “cycle”,“bike” and “tandem”. If the full search mode is chosen, the D/VStandardization Module 11 a will report all these variants to the SearchModule 125. If the optimized search mode is chosen, the D/VStandardization Module 111 a will perform an optimization step on theCUR's variants to select some of them based on relevancy and otherpredetermined rules. In this example, because the “tandem” is much lessfrequently used in daily life, the D/V Standardization Module 111 a onlyselects and reports “bicycle”, “bike” and “cycle” to the Search Module125. If the precise search mode is chosen, and if the use enters“tandem”, then the D/V Standardization Module 111 a will directlyreports “tandem” to the Search Module 125.

The D/V standardization is an essential step because often times wordsencountered have several different dialectal variations. A language suchas English itself is full of dialectal variations in the form of BritishEnglish, American English, Canadian English, Australian English, IndianEnglish, and African English, etc. Good examples of dialectal variationsin British English and American English include centre vs. center, lorryvs. truck, queue vs. line and petrol vs. gasoline etc. Similar instancescould be cited in many of the other languages of the world, too. InChinese, for example there are as many as forty five different dialectalvariations for just one particular word. Such instances corroborate thefact that dialectal variations are the rule rather than the exceptionand therefore the only way to counter them is by standardizing a queryor a word to a commonly known word. Even in a same dialect, a CUR mayhave variants in different semantic regions, such as technical vs.laymen terms, historical vs. current, slang vs. standard, vernacular vs.bookish, regional dialect, personal regional variant due to migration,professional vs. laymen, academic vs. general, Latin origin vs. currentusage, brand default generic terms, first maker default generic terms,best maker default generic terms, traditional vs. simplified, acronymvs. full, abbreviations, different version of transliterations,borrowings, etc.

In the preferred embodiments of this invention, if the D/Vstandardization module fails to recognize the word and thus is unable toperform dialectal/variant standardization, a query prompter unit mayprompt the user for more input or request the user to choose from a setof expressions to assist, to clarify and to sharpen his query. In thatcase the user may submit another query to the query input means. Such aquery may either be a standard term or a non-standard term. For example,different variants of the word “auto” including automobile andtransportation vehicle are permitted to be input by the user as part ofthe dialectal/variant standardization process.

The D/V Standardization Module 111 a and the Database 111 b may beupdated from time to time by incorporating the most recent linguisticdiscoveries and research results such as fuzzy-logic, rules in wordformation, laws and pressures from spontaneous innovations,interpretation of statistics, philology, diachronic studies of lexicaldiffusion, borrowing patterns, genetic relation of language families indifferent depth of time, etymology, core vocabulary and itsmanifestation, ease of physical reproduction, and cognitivescience-human information processing, etc.

The updating work can be done manually by programmers based on theproposals from the linguists. In this situation, the manufacturers orproviders will issue new versions of the application (including thedatabase) to catch up the social and linguistic changes. The updatingwork can also be done by automatic means. For example, the D/Vstandardization module and the database are associated with a Web-basedelectronic survey program. The program collects words, calculates theuse frequency and other values of each word, and constantly updates thedatabase. The program also enables experienced dialectologists, atdifferent geographical regions, to monitor and input variants of samereferent and keywords into the system where there are principal editorsto calculate, evaluate, report of sighting, recording and hearsay ofword usage and standardize. The coverage includes technical vs. laymenterms, historical vs. current, slang vs. standard, vernacular vs.bookish, regional dialect, personal regional variant due to migration,professional vs. laymen, academic vs. general, Latin origin vs. currentusage, brand default generic terms, first maker default generic terms,best maker default generic terms, traditional vs. simplified, acronymvs. full, abbreviations, different version of transliterations,borrowings, etc.

FIG. 5A and FIG. 5 are two schematic flow diagrams illustrating a method170 according the preferred embodiment of FIG. 3. The method includesthe steps of:

Step 171: Enter a query by the user.

Step 172: The system conducts a primary D/V standardization on thequery, i.e. standardize the query based on the D/V rules.

Step 173: The system tries to match the standardized query to acategorically unique referent (CUR) stored in the CUR database.

Step 178: If the standardized query fails to match a CUR in thedatabase, the user will be prompt to change the query. A red flagmechanism will be used to alert editor-linguists and/or supervisingeditor-linguists that there might be a need to create a new CUR, as newwords are emerging now and then, here and there, such as blog, breadmachine, or new sub-units, such as auto-parts, calling for linguisticcommunity consensus.

Step 174: In a full search mode, if the standardized query does match aCUR in the database, the system lists and reports all the variantsassociated with the CUR.

Step 175: Search on each of the variants.

Step 176: Return the search results in an order according to relevancyor other values.

Optionally, if an optimized search is set, Step 173 continues on thefollowing steps:

Step 174 a: In an optimized search mode, if the standardized query doesmatch a CUR in the database, the system lists and reports one or morevariants associated with the CUR based on the rules of preferences.

Step 175 a: Search on each of the selected variants;

Step 176 a: Return the search results in an order according to relevancyor other rules.

FIG. 6 is a schematic diagram illustrating an exemplary utilization ofthe invention in a website's server. The application is installed in thewebsite server 201. Upon entering the website's main page, the user maysearch all pages in the website by entering a keyword via the interface202. FIG. 7 is a schematic block diagram illustrating the operationsaccording to FIG. 6. Before the user initiates a search, he may set thelanguage background 221 and set the search mode 222 in the user'sgraphic interface 202. The user enters a keyword as query. When hestarts the search by clicking the “GO” button, the query is sent to theD/V Standardization Module 224. The D/V Standardization Module 224 firststandardizes the query based on a number of linguistic rules inconnection with the selected language background, and then looks up theDatabase 225 to match the standardized query to a CUR. Then, inaccordance with the selected search mode, the D/V Standardization Module224, together with the Database 225, reports all or some preferredvariants of the CUR to the Search Module 226. Then, the Search Module226 returns the search results 229 to the user via the Display Control228 and the user's graphic interface 202.

FIG. 8 is a schematic flow diagram illustrating a method according tothe preferred embodiment of FIG. 6 and FIG. 7. The method includes thefollowing steps:

Step 251: Access a DVSE enabled website which is in an object language.

Step 252: Select a subject language (which is the user's mostcomfortable language).

Step 253: Enter a query in the subject language.

Step 254: Standardize the query in the subject language.

Step 255: Translate the standardized query into the object language.

Step 256: Match the translated query to a CUR.

Step 257: Search all or some of the preferred variants of the CUR.

FIG. 9 is a schematic diagram illustrating another exemplary utilizationof the invention in a Web-based search engine's host. The application isinstalled in the website server 301 and runs across the Internet 304.Upon entering the host's main page, the user may search across theInternet by entering a keyword via the interface 302. FIG. 10 is aschematic block diagram illustrating the operations according to FIG. 9.Before the user initiates a search, he may set the language background321 and set the search mode 322 in the user's graphic interface 302. Theuser enters a keyword as query. When he starts the search by clickingthe “GO” button, the query is sent to the D/V Standardization Module324. The D/V Standardization Module 324 first standardizes the querybased on a number of linguistic rules in connection with the selectedlanguage background, and then looks up the Database 325 to match thestandardized query to a CUR. Then, in accordance with the selectedsearch mode, the D/V Standardization Module 324, together with theDatabase 325, reports all or some preferred variants of the CUR to theSearch Module 326. Then, the Search Module 326 returns the searchresults 329 to the user via the Display Control 328 and the user'sgraphic interface 302.

FIG. 11 is a schematic flow diagram illustrating a method according tothe preferred embodiment of FIG. 9 and FIG. 10. The method includes thefollowing steps:

Step 351: Access the DVSE's main page which is in an object language.

Step 352: Select a subject language (which is the user's mostcomfortable language).

Step 353: Enter a query in the subject language.

Step 354: Standardize the query in the subject language.

Step 355: Translate the standardized query into the object language.

Step 356: Match the translated query to a CUR.

Step 357: Search all or some of the preferred variants of the CUR.

Although the invention is described herein with reference to thepreferred embodiment, one skilled in the art will readily appreciatethat other applications may be substituted for those set forth hereinwithout departing from the spirit and scope of the present invention.

Accordingly, the invention should only be limited by the claims includedbelow.

1. A system for searching information on a computer network comprising acomputer communicatively coupled to said network, wherein said computercomprises at least one processor, a first memory that stores at leastone program used by said at least one processor to perform operationsrequired for the search and a second memory which is available to saidat least one program for operation, the system further comprising: ameans for standardizing a user's entry; a means for matching thestandardized entry to a categorically unique referent which includes oneor more variants; and a means for reporting some or all of the variantsof said categorically unique referent to a search means; wherein saidsearch means executes a search on each of said reported variants andreturns the search results to the user.
 2. The system of claim 1,further comprising: a means for setting a search mode from any of: fullsearch mode; optimized search mode; and precise search mode; whereinwhen said full search mode is set, said reporting means reports all ofthe variants of said categorically unique referent to said search means;and wherein when said optimized search mode is set, said reporting meansonly reports one or more preferred variants of said categorically uniquereferent to said search means in accordance with one or more rules forpreference; and wherein when the precise search mode is set, the user'sentry is directly reported to said search means.
 3. The system of claim1, further comprising: a means for setting a language background from anumber of options.
 4. The system of claim 1, wherein said standardizingmeans applies a set of statistical, logic, linguistic, and/orgrammatical rules to the user's entry.
 5. The system of claim 1, furthercomprising: a means for prompting the user to enter a different entry inthe event that said matching means fails to match said standardizedentry to a categorically unique referent.
 6. The system of claim 1,wherein said matching means comprises at least one database for storingcategorically unique referents and substantially all variants of each ofsaid categorically unique referents, said at least one database beingdynamically updated online.
 7. In a computer network comprising a serverand at least one client computer communicatively coupled to the server,said server comprising a dialectal/variant standardization module, atleast one database, a search engine and a display control module, whichin combination perform a process, the process comprising the steps of:standardizing a user's entry; matching the standardized entry to acategorically unique referent which includes one or more variants; andreporting one or more of the variants of said categorically uniquereferent to a search means; wherein said search means executes a searchon each of said reported variants and returns the search results to theuser.
 8. The method of claim 7, further comprising the step of: settinga search mode from any of: full search mode; optimized search mode; andprecise search mode; wherein when said full search mode is set, all ofthe variants of said categorically unique referent are reported to saidsearch means; and wherein when said optimized search mode is set, onlyone or more preferred variants of said categorically unique referent arereported to said search means in accordance with one or more rules forpreference; and wherein when the precise search mode is set, the user'sentry is directly reported to said search means.
 9. The method of claim7, further comprising the step of: setting a language background from anumber of options.
 10. The method of claim 7, wherein the step forstandardizing further comprises a sub-step of: applying a set ofstatistical, logic, linguistic, and/or grammatical rules to the user'sentry.
 11. The method of claim 7, further comprising the step of:prompting the user to enter a different entry in the event that saidstandardized entry fails to match a categorically unique referent. 12.The method of claim 7, further comprising the step of dynamicallyupdating online the database containing categorically unique referentsand substantially all variants of each of said categorically uniquereferents.
 13. A computer usable medium containing instructions incomputer readable form for carrying out a process for searchinginformation in a computer network, said process comprising the steps of:standardizing a user's entry; matching the standardized entry to acategorically unique referent which includes one or more variants; andreporting one or more of the variants of said categorically uniquereferent to a search means; wherein said search means executes a searchon each of said reported variants and returns the search results to theuser.
 14. The computer usable medium of claim 13, further comprising thestep of: setting a search mode from any of: full search mode; optimizedsearch mode; and precise search mode; wherein when said full search modeis set, all of the variants of said categorically unique referent arereported to said search means; and wherein when said optimized searchmode is set, only one or more preferred variants of said categoricallyunique referent are reported to said search means in accordance with oneor more rules for preference; and wherein when the precise search modeis set, the user's entry is directly reported to said search means. 15.The computer usable medium of claim 13, further comprising the step of:setting a language background from a number of options.
 16. The computerusable medium of claim 13, wherein the step for standardizing furthercomprises a sub-step of: applying a set of statistical, logic,linguistic, and/or grammatical rules to the user's entry.
 17. Thecomputer usable medium of claim 13, further comprising the step of:prompting the user to enter a different entry in the event that saidstandardized entry fails to match a categorically unique referent. 18.The computer usable medium of claim 13, further comprising the step of:dynamically updating the database containing categorically uniquereferents and substantially all variants of each of said categoricallyunique referents.